And the idea is that we can actually predict how good an attribute is going to be by looking
at the information gain in the answer.
The answer, the general answer to the question is will I wait or won't I wait?
And in our example that gives us, that's essentially a one bit information question because we
have in our example set six examples where we'll wait and six examples where we won't
wait.
So the prior is the same as say an unloaded coin.
So you actually have one bit of information.
It's an unloaded from the example question.
And so we have in the beginning we have one bit of information to gain in the decision
tree.
And so in the tree, while developing the tree, we're hoping to get to a node where I only
have one value forced on me.
In this case, we have no more information left.
It's just basically the same as a coin that always falls on heads.
And answering the question, will it fall on head has no information because we already
know what it will do.
And if you think about the leaves in the tree, that's exactly the situation.
We know what to do there.
So somewhere between oh, we have one bit information to gain and we have zero bits of information
to gain, we have to have some steps of information gain.
And we would like to have high information gain.
So the idea here, let me do a different here, is that we define the information gain between
the actual information of my node, of the distribution in the node, and the one I'm
expecting if the distribution were random.
We're always comparing what we expect at this term and what do we have in our situation.
And so the problem we still have here is what is the information and the information we'll
just basically define via the entropy here, which is a standard formula, which is essentially
you take the prior probabilities, put a log on them, weight them by the probability values
and add them all up.
So that gives us a measure, and that measure allows us to say the gain of patrons is much
higher than the gain of, say, type.
So what do we do?
We make a root of the tree patrons, because that already answers the question more than
half.
And then we do whatever is necessary recursively next.
So that's the idea.
And we see that we get a nice, small, little tree for that.
The tree could be much worse.
We're expecting something like an up to depth 10 tree, because we have 10 attributes.
Here we have depth 4.
Remember trees are exponential, so it makes a difference between depth 4 and depth 10.
So that's really the value of information here in this algorithm.
If you now look at how these things work, essentially, if we're looking at the error
rate or, in this case, the fraction of predictions that our algorithm, this decision tree learning
algorithm makes, given some test sets of varying size, we see here that we are actually reaching
somewhere near 100%.
Presenters
Zugänglich über
Offener Zugang
Dauer
00:05:10 Min
Aufnahmedatum
2021-03-30
Hochgeladen am
2021-03-31 11:06:48
Sprache
en-US
Recap: Using Information Theory (Part 1)
Main video on the topic in chapter 8 clip 5.